The next step is to fit \(RI_{Q}\) to statistical distributions, and see if using the other distributions improves the scores. The selected alternative distributions are Logistic, Log Normal, and Gamma. The MASS package in R was used with the mean least squares parameter estimation algorithm. Each score is calculated as 2\(*\)lower tail (if below the median) or 2\(*\)upper tail (if above the median).
The next step is to capture the amount of variation in parameter fitting \(RI_Q\) to each distribution.
Figure 1: Maximum likelihood parameter estimates for each of the four test distributions (top row: gamma, second row: log normal, third row: logistic, last row: normal), where the point indicates the estimation and the lines indicated +/- 1 standard error, with the exception of gamma which has both of these values log transformed. Red is indicative of a point where the standard error of the estimate failed to calculate.
Parameter estimations for normal, log normal, and logistic are reasonable, with relatively small standard deviations of the estimate (Fig. 1). Gamma distribution estimates have high or NA standard deviations of their estimates, which is indicative of a poor fit.
The Komologorov-Smirnov (KS) test indicates how close the cumulative density functions of two distributions (in this case, the actual \(RI_Q\) and its estimated distribution) by taking the maximum distance between the two curves.
Figure 2: The Kolmogorov-Smirnov (KS) statistic of the distance between the query retention index \(RI_Q\) distribution and a modeled distri-bution using maximum likelihood estimation. The “best” label refers to a distribution with the lowest KS statistic, and the other refers to a distribution that has a p-value “within 0.05” of the best distribution.
The three non-normal distributions tend to have lower KS statistics than the normal distributions (Fig. 2), which may mean these distributions fit the true retention index query distribution better than the normal distribution. There is evidence to suggest that using the kernels for these distributions as an RI score may lead to more improved ranks than adjusting the normal distribution score.
## # A tibble: 5 × 2
## Distribution Count
## <chr> <int>
## 1 Logistic 64
## 2 Log Normal 12
## 3 Gamma 10
## 4 Original 1
## 5 Normal 0
The Akaike Information Criterion (AIC) estimates the prediction error and therefore the relative quality of the distributional fit. Lower AICs indicate better fits. AICs were scaled to the minimum value in each row by subtracting all AICs by their minimum AIC value.
Figure 3: The Akaike Information Criterion (AIC) statistic of the distance between the \(RI_Q\) distribution and an estimate of that distribution using maximum likelihood estimation. AIC has been scaled to the minimum AIC per metabolite, and the binned as listed in the legend above where 0-2 exclusive is Gray is indicative of a distribution that failed to calculate estimated parameters.
The AIC statistic also provides evidence to support that other distributions may fit \(RI_Q\) better than the normal (Fig. 3). According to the AIC statistic, normal may be an appropriate estimation more often than is suggested by the KS statistic. The most interesting find of the AIC statistic, is that logistic fit is rarely far removed from being the best fit.
Figure 4: The proportion of identifications at rank 1 for each scoring method. Gamma, Log Normal, Logistic, Normal Kernel (original RI score), and Normal Probability results are calculated with the 10% holdout analysis.
Overall, retention index scores that use values from each \(RI_Q\) distribution tend to perform better than the original score for our holdout analysis on the full dataset for true positives at rank 1 (Fig. 4).
Figure 5: Proportions of true positives (left two) and true negatives (right two) at ranks 1 and 5 per metabolite, following the 10% holdout analysis with all input data.
We see that the other scores tend to outperform the original at ranks 1 and 5 for both true positives and true negatives.
Figure 6: The median standard deviation of rank following our 10% holdout analysis that was run 50 times, for all metabolites in our subset (n = 87), broken down by sample type and separated by true positives and negatives.
Variation in rank performance may be due to sample type over the score used (Fig. 6).
Figure 7: The distributional properties of the query retention indices for the best performing scores at rank 5, separated by the metabolites where the original score performed better or any other score performed better for the mean and standard deviation. Ties were removed.
More extreme means and sd and less extreme skew perform better with other scores than the original RI score (Fig. 7).
Figure 8: Estimated distributions for each score overlaid on the retention index query distribution, centered to each metabolite’s reference retention index. Representative metabolites with the best performance (highest true positive proportion or lowest true negative proportion) for the original score (left column) or another distribution (right column) were selected for both true positives (top row) and true negatives (bottom row).
If the original score’s assumed distribution tends to fit the \(RI_Q\) distribution, it performs better than the cases where the assumed distribution is not a good fit (Fig. 8).
The final question relates to whether the results change if we do a holdout analysis with only our subset metabolites and all their true positive, true negative, and unknown annotations.
Figure 9: Proportions of true positives (left two) and true negatives (right two) at ranks 1 and 5 per metabolite, following the 10% holdout analysis with only metabolites within our subset.
If we reduce our holdout analysis to just our subsetted compounds, the other scores still outperform the original (Fig. 9).
## # A tibble: 4 × 8
## # Groups: Truth [2]
## Truth Rank Original `Original Adjusted` Gamma Log N…¹ Logis…² Normal
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 True Positive 1 0.0508 0.214 0.215 0.213 0.242 0.213
## 2 True Positive 5 0.370 0.742 0.742 0.740 0.700 0.741
## 3 True Negative 1 0.0084 0.0088 0.0086 0.0087 0.0092 0.0087
## 4 True Negative 5 0.0793 0.055 0.0558 0.0548 0.063 0.0546
## # … with abbreviated variable names ¹`Log Normal`, ²Logistic
## # A tibble: 4 × 8
## # Groups: Truth [2]
## Truth Rank Original `Original Adjusted` Gamma Log N…¹ Logis…² Normal
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 True Positive 1 0.487 0.785 0.784 0.784 0.810 0.785
## 2 True Positive 5 0.896 0.999 0.999 0.999 0.998 0.999
## 3 True Negative 1 0.129 0.101 0.0704 0.0694 0.0859 0.0696
## 4 True Negative 5 0.764 0.749 0.744 0.744 0.755 0.744
## # … with abbreviated variable names ¹`Log Normal`, ²Logistic